Achieving Human Parity on Visual Question Answering

نویسندگان

چکیده

The Visual Question Answering (VQA) task utilizes both visual image and language analysis to answer a textual question with respect an image. It has been popular research topic increasing number of real-world applications in the last decade. This paper introduces novel hierarchical integration vision AliceMind-MMU (ALIbaba’s Collection Encoder-decoders from Machine IntelligeNce lab Damo academy - MultiMedia Understanding), which leads similar or even slightly better results than human being does on VQA. A framework is designed tackle practical problems VQA cascade manner including: (1) diverse semantics learning for comprehensive content understanding; (2) enhanced multi-modal pre-training modality adaptive attention; (3) knowledge-guided model three specialized expert modules complex task. Treating different types questions corresponding expertise needed plays important role boosting performance our architecture up level. An extensive set experiments are conducted demonstrate effectiveness new work.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Investigating Embedded Question Reuse in Question Answering

The investigation presented in this paper is a novel method in question answering (QA) that enables a QA system to gain performance through reuse of information in the answer to one question to answer another related question. Our analysis shows that a pair of question in a general open domain QA can have embedding relation through their mentions of noun phrase expressions. We present methods f...

متن کامل

Revisiting Visual Question Answering Baselines

Visual question answering (VQA) is an interesting learning setting for evaluating the abilities and shortcomings of current systems for image understanding. Many of the recently proposed VQA systems include attention or memory mechanisms designed to support “reasoning”. For multiple-choice VQA, nearly all of these systems train a multi-class classifier on image and question features to predict ...

متن کامل

iVQA: Inverse Visual Question Answering

In recent years, visual question answering (VQA) has become topical as a long-term goal to drive computer vision and multi-disciplinary AI research. The premise of VQA’s significance, is that both the image and textual question need to be well understood and mutually grounded in order to infer the correct answer. However, current VQA models perhaps ‘understand’ less than initially hoped, and in...

متن کامل

Speech-Based Visual Question Answering

This paper introduces the task of speech-based visual question answering (VQA), that is, to generate an answer given an image and an associated spoken question. Our work is the first study of speechbased VQA with the intention of providing insights for applications such as speech-based virtual assistants. Two methods are studied: an end to end, deep neural network that directly uses audio wavef...

متن کامل

Open-Ended Visual Question-Answering

This thesis studies methods to solve Visual Question-Answering (VQA) tasks with a Deep Learning framework. As a preliminary step, we explore Long Short-Term Memory (LSTM) networks used in Natural Language Processing (NLP) to tackle Question-Answering (text based). We then modify the previous model to accept an image as an input in addition to the question. For this purpose, we explore the VGG-1...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: ACM Transactions on Information Systems

سال: 2022

ISSN: ['1558-1152', '1558-2868', '1046-8188', '0734-2047']

DOI: https://doi.org/10.1145/3572833